CUDIA: Probabilistic Cross-level Imputation using Individual Side Information
نویسندگان
چکیده
Due to privacy or legal issues, aggregate data publication is a common practice in healthcare and medical research. However, to find out valuable individual level relationships from the aggregate data, many data mining algorithms suffer from the aggregation bias and the information loss, or require rather strict assumptions, which are usually unverifiable. Furthermore, even if individual level data are available, as many healthcare studies are performed with a pre-specified goal, a limited scope of variables constraints the range of the research focus. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this paper, we seek a better utilization of variably aggregate datasets, which are possibly from different sources. By modeling the generative process of such datasets using a Bayesian directed graphical model, we propose a novel “cross-level” imputation technique. The imputation is based on the underlying data distribution and shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, showing improved performances than just imputing the aggregate information as it is. Experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework.
منابع مشابه
Inbred Strain Variant Database (ISVdb): A Repository for Probabilistically Informed Sequence Differences Among the Collaborative Cross Strains and Their Founders
The Collaborative Cross (CC) is a panel of recently established multiparental recombinant inbred mouse strains. For the CC, as for any multiparental population (MPP), effective experimental design and analysis benefit from detailed knowledge of the genetic differences between strains. Such differences can be directly determined by sequencing, but until now whole-genome sequencing was not public...
متن کاملBias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation
Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred ex...
متن کاملEstimation of genotype imputation accuracy using reference populations with varying degrees of relationship and marker density panel
Genotype imputation from low-density to high-density (SNP) chips is an important step before applying genomic selection, because denser chips can provide more reliable genomic predictions. In the current research, the accuracy of genotype imputation from low and moderate-density panels (5K and 50K) to high-density panels in the purebred and crossbred populations was assessed. The simulated popu...
متن کاملMeasuring Disclosure Risk for a Synthetic Data Set Created Using Multiple Methods
Government agencies must simultaneously maintain confidentiality of individual records and disseminate useful microdata. We propose a method to create synthetic data that combines quantile regression, hot deck imputation, and rank swapping. The result from implementation of the proposed procedure is a releasable data set containing original values for a few key variables, synthetic quantile reg...
متن کاملA Probabilistic Imputation Framework for Regression Analysis using Variably Aggregated, Multi-source Healthcare Data
Many measures of healthcare delivery or quality are not publicly available at the individual patient or hospital level largely due to privacy restrictions, legal issues or reporting norms. Instead, such measures are provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRRs and HSAs). Such levels constitute partitionings of the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011